Load and bind datasets into single one. Adding manufacturer column.
df <- map2(# map through file and manufacturer names and read dataframes
c("audi", "bmw", "merc", "vw"), # filename
c("Audi", "BMW", "Mercedes", "Volkswagen"), # manufacturer
function(filename, manufacturer) {
read_csv(glue("./data/{filename}.csv"),
col_types = "fiififidd"
) %>%
mutate(manufacturer = as_factor(manufacturer)) # add column
}
) %>%
reduce(~ bind_rows(.x, .y)) # Bind rows into single dataframe
We took a sample of 5000 elements.
set.seed(19990428)
df <- df %>%
slice_sample(n = 5000)
We have 10 variables:
modelyearpricetransmissionmileagefuelTypetaxmpgengineSizemanufacturerFrom these variables, 6 are numeric. In we see the distribution of these numeric variables with boxplots.
Boxplots of numeric variables in the dataset
The variables year corresponds to a qualitative concept and thus it should be treated as a factor, to complement this change we add a new variable age which corresponds to the age of the car. Given that the dataset is from 2020, we compute age = 2020 - year. This variable is numeric. Additionally, we add auxiliary variables to the numeric ones that discretize them into intervals. To simplify the intervals, the price and mileage values in the auxiliary variables where divided by 1000.
The variable engineSize was converted to a factor since it can be argued that it is a qualitative concept and there are a finite number of engine sizes in the dataset.
We added manufacturer to the model column just in case there where models with the same name from different manufacturers.
df <- df %>% mutate(
model = as_factor(paste0(manufacturer, " - ", model)),
age = 2020 - year,
aux_price = cut_number(price / 1000, 4),
aux_mileage = cut_number(mileage / 1000, 4),
aux_mpg = cut_number(mpg, 4),
aux_tax = cut_number(tax, 2),
aux_age = cut_number(age, 4),
year = as_factor(year),
engineSize = as_factor(engineSize)
)
shows a summary of the numeric variables. Likewise, shows a summary of the categorical variables excluding model and engineSize.
#> Registered S3 method overwritten by 'papeR':
#> method from
#> Anova.lme car
| N | Mean | SD | Min | Q1 | Median | Q3 | Max | |||
|---|---|---|---|---|---|---|---|---|---|---|
| price | 5000 | 21571.45 | 11544.02 | 1295.0 | 13990.0 | 19498.0 | 26030.0 | 149948.0 | ||
| mileage | 5000 | 23054.10 | 22309.69 | 1.0 | 5904.0 | 16500.0 | 33297.0 | 168000.0 | ||
| tax | 5000 | 123.60 | 62.56 | 0.0 | 125.0 | 145.0 | 145.0 | 570.0 | ||
| mpg | 5000 | 54.19 | 18.11 | 1.1 | 45.6 | 53.3 | 61.4 | 470.8 | ||
| age | 5000 | 2.78 | 2.10 | 0.0 | 1.0 | 3.0 | 4.0 | 19.0 |
| Level | N | % | ||
|---|---|---|---|---|
| transmission | Manual | 1784 | 35.7 | |
| Automatic | 1332 | 26.6 | ||
| Semi-Auto | 1884 | 37.7 | ||
| Other | 0 | 0.0 | ||
| fuelType | Petrol | 2065 | 41.3 | |
| Diesel | 2860 | 57.2 | ||
| Hybrid | 65 | 1.3 | ||
| Other | 10 | 0.2 | ||
| Electric | 0 | 0.0 | ||
| manufacturer | Audi | 1072 | 21.4 | |
| BMW | 1106 | 22.1 | ||
| Mercedes | 1340 | 26.8 | ||
| Volkswagen | 1482 | 29.6 |
| Level | N | % | ||
|---|---|---|---|---|
| aux_price | [1.29,14] | 1254 | 25.1 | |
| (14,19.5] | 1252 | 25.0 | ||
| (19.5,26] | 1244 | 24.9 | ||
| (26,150] | 1250 | 25.0 | ||
| aux_mileage | [0.001,5.9] | 1251 | 25.0 | |
| (5.9,16.5] | 1252 | 25.0 | ||
| (16.5,33.3] | 1247 | 24.9 | ||
| (33.3,168] | 1250 | 25.0 | ||
| aux_mpg | [1.1,45.6] | 1338 | 26.8 | |
| (45.6,53.3] | 1291 | 25.8 | ||
| (53.3,61.4] | 1188 | 23.8 | ||
| (61.4,471] | 1183 | 23.7 | ||
| aux_tax | [0,145] | 3969 | 79.4 | |
| (145,570] | 1031 | 20.6 | ||
| aux_age | [0,1] | 1888 | 37.8 | |
| (1,3] | 1453 | 29.1 | ||
| (3,4] | 871 | 17.4 | ||
| (4,19] | 788 | 15.8 |
There are 88 different models and 29 different engine sizes. shows the distribution of engineSize. In we show the 15 most common car models in our sample.
Distribution of engine sizes in the sample
Most popular car models
If we count the number of NA values per row, we find that there are no explicit NA in the sample, as shown in :
| Variable | Missing | Zeros |
|---|---|---|
| model | 0 | 0 |
| year | 0 | 0 |
| price | 0 | 0 |
| transmission | 0 | 0 |
| mileage | 0 | 0 |
| fuelType | 0 | 0 |
| tax | 0 | 152 |
| mpg | 0 | 0 |
| engineSize | 0 | 13 |
| manufacturer | 0 | 0 |
To find severe outliers, for each numeric variable, we compute the IQR and check which values are outside the range (Q1 - 3*IQR, Q3 + 3*IQR).
shows how many individuals have 0, 1, 2 or 3 outliers ( there are no individuals with more than 3 severe outliers).
| n_outliers | count |
|---|---|
| 0 | 3662 |
| 1 | 1285 |
| 2 | 51 |
| 3 | 2 |
The cars with 3 outliers are shown in and have outliers in tax, mileage and age.
| n_outliers | model | year | mileage | tax | age |
|---|---|---|---|---|---|
| 3 | Mercedes - C Class | 2004 | 119000 | 300 | 16 |
| 3 | Mercedes - M Class | 2004 | 121000 | 325 | 16 |
| price | mileage | tax | mpg | age |
|---|---|---|---|---|
| 49 | 26 | 1251 | 54 | 13 |
In we can see that tax has a very high number of severe outliers. If we plot the density function for the variable as shown in , we can see that most of the values are around 145 and all the other peaks are labeled as severe outliers since the IQR is 20. There is clearly a group of cars which pay lower taxes, this may be correlated with other variables such as fuelType of engineSize.
Tax density plot with IQR
To detect multivariate outliers, we use Moutlier form the chemometrics package. We found 156 multivariate outliers. shows a list of the 10 individuals with biggest Mahalanobis distance.
| model | age | price | mileage | tax | mpg | fuelType | engineSize | transmission | moutlier_md |
|---|---|---|---|---|---|---|---|---|---|
| BMW - i3 | 3 | 19495 | 17338 | 135 | 470.8 | Hybrid | 0 | Automatic | 24.60064 |
| BMW - i3 | 3 | 20000 | 19178 | 0 | 470.8 | Other | 0.6 | Automatic | 24.59941 |
| BMW - i3 | 3 | 17600 | 50867 | 135 | 470.8 | Other | 0.6 | Automatic | 24.35757 |
| Mercedes - SL CLASS | 9 | 149948 | 3000 | 570 | 21.4 | Petrol | 6.2 | Automatic | 15.74446 |
| Mercedes - C Class | 18 | 1495 | 13800 | 305 | 39.8 | Diesel | 2.7 | Automatic | 12.14130 |
| Mercedes - G Class | 1 | 139948 | 12000 | 145 | 21.4 | Petrol | 4 | Automatic | 11.90107 |
| Mercedes - A Class | 1 | 140319 | 785 | 150 | 22.1 | Petrol | 4 | Semi-Auto | 11.84491 |
| Audi - R8 | 0 | 137995 | 70 | 145 | 21.1 | Petrol | 5.2 | Semi-Auto | 11.39794 |
| BMW - X5 | 1 | 72990 | 4799 | 140 | 188.3 | Hybrid | 3 | Semi-Auto | 10.67216 |
| Audi - R8 | 1 | 125000 | 100 | 145 | 24.1 | Petrol | 5.2 | Automatic | 10.22799 |
There where only 3 electric cars in the original dataset before the sample, in our sample, we have no electric cars, however there where cars with engineSize 0. As shown in . Since they where not classified as Other we decided that this was erroneous data which should be imputed.
| fuelType | n |
|---|---|
| Petrol | 5 |
| Diesel | 7 |
| Hybrid | 1 |
The histogram of the price shows a very right skewed distribution which does not seem compatible with a normal fit. Moreover, the Shapiro test returns a small p-value of less than \(2.2\times10^{-16}\), which makes us reject the null hypothesis of normality.
Additionally, we also checked if the price followed a log normal distribution. If we analyse the new histogram we can see that the log transformation corrected the skewness and the new distribution seems to resemble a normal bell shape. However the Shapiro test returns a p-value of less than \(2.2\times10^{-16}\), which makes us reject the null hypothesis of normality.
shows the QQ plots of price and log*(price). We can see that price clearly does not follow a normal distribution and that log(price) is heavy tailed.
QQ plots
We perform a Durbin-Watson test with the null hypothesis that the autocorrelation of the disturbances is 0. We obtain a p-value of 0.95 so we fail to reject the null hypothesis.
The results of the test are consistent with the visual interpretation of the ACF plot1 shown in . All the values except lag = 33 lie within the confidence interval of 95%, showing that there is no autocorrelation.
ACF plot for price
Since we determined that price does not follow a normal distribution, we compute a correlation matrix using the spearman coefficient. The plot of the correlation is shown in and shows that the numerical variables most associated with price are: age, mileage and mpg. Surprisingly, tax has the lowest correlation coefficient. The specific values of the correlation matrix are shown in .
Spearman correlation plot
| price | mileage | tax | mpg | age | |
|---|---|---|---|---|---|
| price | 1.00 | -0.64 | 0.39 | -0.56 | -0.69 |
| mileage | -0.64 | 1.00 | -0.25 | 0.43 | 0.85 |
| tax | 0.39 | -0.25 | 1.00 | -0.59 | -0.29 |
| mpg | -0.56 | 0.43 | -0.59 | 1.00 | 0.41 |
| age | -0.69 | 0.85 | -0.29 | 0.41 | 1.00 |
| Variable | R2 |
|---|---|
| model | 0.52 |
| year | 0.33 |
| engineSize | 0.48 |
| transmission | 0.21 |
| manufacturer | 0.08 |
| fuelType | 0.00 |
Using condes method from FactoMineR, we computed the correlation with the qualitative variables as shown in . The most relevant qualitative variable is model, closely followed by engineSize and then year (this correlates with the results of the numerical variable age). Finally, transmission has little less significance and manufacturer and fuelType have almost no significance.
The variables most associated with our response variable are (in decreasing order of importance):
modelengineSizeyear / agemileagempgWe start by checking the ANOVA assumptions of normality and homogeneity of variance.
Boxplot of price by age group
The fligner test returns a p-value of less than \(2.2\times10^{-16}\) which makes us reject the null hypothesis of homogeneity of variance. Additionally, we can’t assume normality. For this reason we will use the non-parametric Kruskal-Wallis test.
The test returns a p.value of less than \(2.2\times10^{-16}\). Which is less than the significance level and thus we reject the NULL hypothesis that the location parameters of all the samples are equal. We have statistical proof that the average price does depend on the age. The visual inspection of the boxplots in is consistent with the results.
Anova results show that the both factors are significant (p.value < 0.05) as well as their interaction.
| aux_age | fuelType | count | mean | sd |
|---|---|---|---|---|
| [0,1] | Petrol | 902 | 27176.09 | 13688.34 |
| [0,1] | Diesel | 959 | 30772.99 | 10482.14 |
| [0,1] | Hybrid | 24 | 35200.71 | 12473.14 |
| [0,1] | Other | 3 | 14496.33 | 1500.00 |
| (1,3] | Petrol | 653 | 18679.41 | 9526.78 |
| (1,3] | Diesel | 775 | 21369.43 | 7488.77 |
| (1,3] | Hybrid | 20 | 26032.10 | 12783.02 |
| (1,3] | Other | 5 | 19985.80 | 2510.81 |
| (3,4] | Petrol | 275 | 14322.52 | 7491.42 |
| (3,4] | Diesel | 580 | 16356.18 | 4564.75 |
| (3,4] | Hybrid | 15 | 17874.60 | 4565.66 |
| (3,4] | Other | 1 | 24500.00 | NA |
| (4,19] | Petrol | 235 | 11796.59 | 10647.90 |
| (4,19] | Diesel | 546 | 12725.60 | 4697.63 |
| (4,19] | Hybrid | 6 | 19423.83 | 11920.52 |
| (4,19] | Other | 1 | 10489.00 | NA |
We execute Fligner-Killeen test with each factor and the interaction of both. The resulting p.values are 0.056 for the price, less than \(2.2\times10^{-16}\) for the age and less than \(2.2\times10^{-16}\) combining both. In the case of age and age:fuelType, results show that there is clear evidence to reject the null hypothesis of equal variances for all groups. The results when grouping by fuelType are more inconclusive as the p.value is slightly over significance level.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 29687.28 | 229.64 | 129.28 | 0 |
| age | -2921.89 | 65.98 | -44.28 | 0 |
| statistic | value |
|---|---|
| Residual standard error | 9784.206 |
| Degrees of freedom | 2, 4998, 2 |
| Multiple R-squared | 0.281792 |
| Adjusted R-squared | 0.2816483 |
| F-statistic | 1960.987, 1.000, 4998.000 |
Linear model price \(\sim\) age
Model residuals
shows the linear regression model on price ~ age. The model parameters and statistics is shown in . A simple visual analysis shows that around 10 years, the price in our model goes negative, which does not make sense in the real world. The fit is clearly skewed by the larger amount of data with lower age values. In general this is a very bad fit.
In we can see that the residuals seem to not hold homoskedasticity. Performing a Breusch-Pagan Test we see that the p-value is 0.0013 which is less than 0.05 and thus we reject the NULL hypothesis of homoskedasticity. The results of the test are consistent with what is shown in the residuals from which seem to suggest both no-linearity and heteroscedasticity.
The age explains 0.2816483 of the price variability according to our model.
Residuals of model with quadratic term
The new model explains 0.3175489 of the price variance which is an improvement from the previous one. Moreover the quadratic age term seems to be relevant because when testing if its coefficient is equal to zero we get a really small p.value.
Additionally, we compared the previous model with the new one which adds the quadratic term using ANOVA. The resulting small p.value of makes us reject the null hypothesis of equal means. This implies that the new model significantly improves on the previous one.
Nevertheless, there is still a clear pattern of heteroscedasticity as seen in and that is statistically proved through the Breusch-Pagan test (with p-value < 0.05 we reject the null hypothesis of homoscedasticity).
| VIF | |
|---|---|
| mileage | 2.62 |
| tax | 1.25 |
| mpg | 1.25 |
| age | 2.54 |
| VIF | |
|---|---|
| tax | 1.24 |
| mpg | 1.24 |
| age | 1.04 |
Performing a variance inflation factor analysis, we see that both age and mileage have high values, if we examine their correlation, we can see that it seems to be a linear correlation between the two, as shown in . Additionally, mileage has the biggest p-value in our model. Given these two facts, consider not using mileage in the model.
Colinearity between age and mileage
The new model without mileage has almost the same R-squared value and the results of the VIF analysis are much more reasonable. There seems to be a small correlation between tax and mpg, but it is not significantly relevant as shown by the small VIF values.
Performing an analysis of covariance between the model and all the available factors, we obtain that the additive effect of each of them on the price is statistically significant (p-value is less than \(2.2\times10^{-16}\) in all the cases).
So far, the best model obtained so far is the one which includes all the numerical variables and the quadratic factor on age.
| Coefficient | |
|---|---|
| (Intercept) | 31707.57 |
| mileage | -0.06 |
| tax | 30.95 |
| mpg | -102.71 |
| age | -2947.53 |
| I(age^2) | 92.47 |
The Intercept shows us that the expected initial value for a new car is around 32000 pounds. For every mile the price drops by 0.0576349. For each pound of the tax, the value increases by 30.948643. Contrary to what one might think, miles per gallon (mpg) has a negative effect on the price of the car, this may be caused by the extreme outliers in the mpg variable which are the BMW - i3, a very expensive car with hybrid technology that uses petrol to charge the electric batteries and extend its range.
The price of the car drops by 102.7146839 for each year of age. Note that with this slope, at around 10 years, the price would be negative, this is compensated by the \(age^2\) factor. However, this means that the model does not translate well to cars much older than the ones in our sample, since the \(age^2\) increases more rapidly than age meaning that there is a point where the car price starts to increase the older it gets. This may be valid in some cases with vintage cars, but in general common sense dictates that it should approach a base value close to 0 as age tends to infinity.
log-Likelihood plot
If we compute the value of the boxcox transformation as shown in , we obtain a value of lambda of 0.020202. The graphic shows that 0 is inside our confidence interval, indicating that a log transformation of the data is needed.
Model residuals
With the logarithm of the price, we obtain a higher value of \(R^2=0.568219\). The residuals of the model are shown in .
If we add all factor variables to the model (except auxiliary ones), we obtain a model which explains 0.9345831 of the variability. The factor variable model has great influence, if we remove it our model covers 0.8837573. This makes sense, given that we expect cars from the same model to have similar prices, also model is by far the category with most factors.
The best model obtained so far is the one using the log transformation on price and all the numerical variables and factors.
Given what we found on about the collinearity of mileage and age we consider the model without including mileage. Also we found that the significance level of the numeric variable tax is not statistically significant when all the factors are added. We evaluated the and BIC of the model, the model without mileage, without tax and without mileage or tax. The results are shown in . The best model is the original one with both tax and mileage (has higher and lower BIC).
| model | R2 | BIC |
|---|---|---|
| base | 0.9330 | -5514.386 |
| -mileage | 0.9102 | -4060.839 |
| -tax | 0.9327 | -5500.504 |
| -mileage, -tax | 0.9102 | -4067.909 |
shows the residuals of the best model obtained so far. We can see that there are no clear patterns on the residuals. There are still some residuals which are clearly outliers and the distribution of residuals in the qqplot is highly tailed.
Residuals of log(price) model
Studentized residuals outliers
shows the studentized residuals with the severe outliers labeled. The data corresponds to values shown in .
| rowid | model | year | price | transmission | mileage | fuelType | tax | mpg | engineSize | age | stud_resids |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 225 | Mercedes - C Class | 2002 | 1495 | Automatic | 13800 | Diesel | 305 | 39.8 | 2.7 | 18 | -11.78 |
| 1389 | Mercedes - A Class | 2019 | 140319 | Semi-Auto | 785 | Petrol | 150 | 22.1 | 4 | 1 | 6.68 |
| 2383 | BMW - Z4 | 2008 | 14000 | Manual | 63000 | Petrol | 325 | 31.7 | 3 | 12 | 4.50 |
| 2784 | Mercedes - C Class | 2004 | 1495 | Manual | 119000 | Petrol | 300 | 34.5 | 1.8 | 16 | -5.68 |
| 3040 | Mercedes - GLE Class | 2016 | 7750 | Semi-Auto | 77456 | Diesel | 235 | 42.8 | 3 | 4 | -8.92 |
| 3078 | Audi - A1 | 2016 | 8695 | Manual | 30000 | Petrol | 240 | 39.8 | 2 | 4 | -4.30 |
| 3712 | BMW - 7 Series | 2007 | 5200 | Automatic | 83000 | Diesel | 325 | 34.4 | 2.5 | 13 | -5.77 |
| 3767 | Mercedes - A Class | 2018 | 89990 | Automatic | 6800 | Petrol | 145 | 24.8 | 4 | 2 | 4.34 |
| 3817 | Volkswagen - Golf | 2011 | 14999 | Manual | 61422 | Petrol | 305 | 33.2 | 2 | 9 | 5.32 |
| 4276 | Mercedes - A Class | 2017 | 79999 | Semi-Auto | 13781 | Petrol | 145 | 30.1 | 4 | 3 | 4.48 |
| 4508 | Mercedes - A Class | 2010 | 1350 | Manual | 116126 | Diesel | 145 | 54.3 | 2 | 10 | -11.15 |
| 4639 | Mercedes - M Class | 2004 | 19950 | Automatic | 121000 | Diesel | 325 | 29.7 | 2.7 | 16 | 13.93 |
| 4869 | Volkswagen - Passat | 2010 | 1495 | Manual | 168000 | Diesel | 125 | 60.1 | 2 | 10 | -6.65 |
In the initial analysis of the data, we identified 5000 multivariate outliers using Mahalanobis distance. The list of all the indices is shown below:
shows the plot of the influential data using DFBETAS for the different numerical variables as well as cooks distance. Since we have a big sample of 5000 observations, we used the cutoff at 0.5.
Influential data with DFBETAS
The plot in shows the DFFIT metric for the different observations in the dataset. The labels shown correspond to the values above 1. We can see that most of the influential values found using DFBETAS are also influential using DFFIT.
Influential data with DFFIT
| rowid | model | year | price | transmission | mileage | fuelType | tax | mpg | engineSize | age | Moutlier |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 134 | Volkswagen - Caddy Life | 2017 | 19995 | Manual | 15860 | Petrol | 145 | 56.5 | 2 | 3 | FALSE |
| 225 | Mercedes - C Class | 2002 | 1495 | Automatic | 13800 | Diesel | 305 | 39.8 | 2.7 | 18 | TRUE |
| 830 | Mercedes - X-CLASS | 2017 | 31994 | Automatic | 24800 | Diesel | 260 | 35.8 | 2.3 | 3 | FALSE |
| 1389 | Mercedes - A Class | 2019 | 140319 | Semi-Auto | 785 | Petrol | 150 | 22.1 | 4 | 1 | TRUE |
| 1995 | Volkswagen - Caravelle | 2006 | 14495 | Manual | 106000 | Diesel | 325 | 34.4 | 2.5 | 14 | TRUE |
| 2383 | BMW - Z4 | 2008 | 14000 | Manual | 63000 | Petrol | 325 | 31.7 | 3 | 12 | TRUE |
| 2784 | Mercedes - C Class | 2004 | 1495 | Manual | 119000 | Petrol | 300 | 34.5 | 1.8 | 16 | TRUE |
| 2942 | Audi - Q5 | 2019 | 44790 | Automatic | 5886 | Petrol | 135 | 117.7 | 2 | 1 | TRUE |
| 2995 | Audi - A8 | 2020 | 78990 | Automatic | 250 | Diesel | 145 | 39.2 | 3 | 0 | TRUE |
| 3020 | Mercedes - M Class | 2011 | 7995 | Automatic | 131000 | Diesel | 555 | 31.0 | 3 | 9 | TRUE |
| 3040 | Mercedes - GLE Class | 2016 | 7750 | Semi-Auto | 77456 | Diesel | 235 | 42.8 | 3 | 4 | FALSE |
| 3123 | Volkswagen - Caddy Life | 2019 | 17995 | Manual | 2156 | Diesel | 150 | 51.4 | 2 | 1 | FALSE |
| 3225 | Volkswagen - Shuttle | 2017 | 32995 | Semi-Auto | 4828 | Diesel | 145 | 47.1 | 2 | 3 | FALSE |
| 3339 | Mercedes - GL Class | 2014 | 24498 | Automatic | 67833 | Diesel | 325 | 35.3 | 3 | 6 | FALSE |
| 3410 | Mercedes - GL Class | 2015 | 31998 | Semi-Auto | 36281 | Diesel | 330 | 36.2 | 3 | 5 | FALSE |
| 3637 | Mercedes - CLA Class | 2020 | 54900 | Automatic | 3600 | Petrol | 145 | 33.2 | 2 | 0 | FALSE |
| 3712 | BMW - 7 Series | 2007 | 5200 | Automatic | 83000 | Diesel | 325 | 34.4 | 2.5 | 13 | TRUE |
| 4090 | BMW - i3 | 2017 | 19495 | Automatic | 17338 | Hybrid | 135 | 470.8 | 0.6 | 3 | TRUE |
| 4508 | Mercedes - A Class | 2010 | 1350 | Manual | 116126 | Diesel | 145 | 54.3 | 2 | 10 | FALSE |
| 4515 | Audi - A6 | 2011 | 6495 | Automatic | 94700 | Diesel | 235 | 44.1 | 2.7 | 9 | FALSE |
| 4555 | BMW - X5 | 2019 | 72990 | Semi-Auto | 4799 | Hybrid | 140 | 188.3 | 3 | 1 | TRUE |
| 4639 | Mercedes - M Class | 2004 | 19950 | Automatic | 121000 | Diesel | 325 | 29.7 | 2.7 | 16 | TRUE |
| 4869 | Volkswagen - Passat | 2010 | 1495 | Manual | 168000 | Diesel | 125 | 60.1 | 2 | 10 | TRUE |
| 4938 | Audi - TT | 2016 | 39995 | Semi-Auto | 16000 | Petrol | 300 | 34.0 | 2.5 | 4 | FALSE |
In we show all the influential data labelled with either DFFIT or DFBETAS. The column Moutlier shows the variables which where labelled as multivariate outliers a priori. In more than half the cases the influential data was not a multivariate outlier we detected a priori.
We use the model from previous sections but removing all the influential data points found in the previous section.
| model | transmission | fuelType | engineSize | manufacturer | mileage | tax | mpg | age |
|---|---|---|---|---|---|---|---|---|
| BMW - 2 Series | Semi-Auto | Diesel | 2 | BMW | 23054.1 | 123.6 | 54.19 | 5 |
For the data shown in we obtain an expected price of 14556.53 with a 95% confidence interval of (14203.66, 14918.17).
We first realized that the price did not follow a normal distribution, because there are several luxury cars with high prices. This impacted the modeling phase, as linear models without any transformation to the price obtained much worse results than when applying a logarithmic transformation.
In general, we also learn that different groups of the same factors had significantly different price means and variances. Moreover, we also found some errors like non-electric cars with engine size equal to zero and some instances with many severe outliers. When searching the models that were outliers, we realized that they were indeed quite peculiar cars. We also showed that when we removed these rows from the analysis, the quality of our models increased.
All variables seem to be important to predict the price except the tax and mileage. The mileage is useful by its own but its correlation with age makes it redundant. Tax has a high concentration of values and therefore lacks much discriminatory value. We were also rather surprised, when we found out that our best model could explain such high variance of the price with the limited number of features available.
We were also surprised by the behaviour of price and age. As expected, the price tended to decrease rapidly in the first years and then quickly flat-lined. Nevertheless, we assumed that at some point as cars become vintage the price would slightly increase. However our data does not seem to present this pattern.
lag 0 is omitted for clarity↩︎